Meta-Classifier Approaches to Reliable Text Classification

نویسنده

  • A. M. Kaptein
چکیده

A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a classification is classified as either reliable or unreliable. When the classification is classified as unreliable, it is like saying “I do not know”, and the instance does not receive a classification. Given a base classifier, the meta-classifier approach is to train a metaclassifier that predicts the correctness of each classification of the base classifier. The classification rule of the meta-classifier approach is to assign a class predicted by the base classifier to an instance if the meta-classifier decides that the base classification is reliable. The meta-classifier approach is applied on text classification tasks provided by the CBS to answer the following problem statement: Does the meta-classifier approach provide a practical solution to reliable text classification? The first part of the research studies text classifiers, and provides an answer to the research question: 1. Which text classifiers achieve high accuracy and at the same time have small space and time complexity? Experiments on the CBS datasets show that the nearest neighbour and the näıve Bayes algorithm in combination with the tfidf text representation are acceptable text classifiers. The second part of the research studies the meta-classifier approach to provide an answer to the second and third research question. Our second research question is: 2. What type of metadata representation is best suited for reliable text classification? The meta-classifier is trained on several types of metadata representations. The used metadata representations include the original instances, the probability distribution of the base classifier and a set of basic statistics about the classification of the base classifier. For tasks with many classes, the original instances representation is best. For tasks with a small number of classes, the original instances representation and the probability distribution are both good candidates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Meta-Classification using SVM Classifiers for Text Documents

Text categorization is the problem of classifying text documents into a set of predefined classes. In this paper, we investigated three approaches to build a meta-classifier in order to increase the classification accuracy. The basic idea is to learn a metaclassifier to optimally select the best component classifier for each data point. The experimental results show that combining classifiers c...

متن کامل

ارتقای کیفیت دسته‌بندی متون با استفاده از کمیته‌ دسته‌بند دو سطحی

Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...

متن کامل

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005